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Objective:  Two  experiments  were  conducted  in  which  participants  navigated  a  simu¬ 
lated  unmanned  aerial  vehicle  (UAV)  through  a  series  of  mission  legs  while  searching 
for  targets  and  monitoring  system  parameters.  The  goal  of  the  study  was  to  highlight 
the  qualitatively  different  effects  of  automation  false  alarms  and  misses  as  they  relate 
to  operator  compliance  and  reliance,  respectively.  Background:  Background  data 
suggest  that  automation  false  alarms  cause  reduced  compliance,  whereas  misses  cause 
reduced  reliance.  Method:  In  two  studies,  32  and  24  participants,  including  some 
licensed  pilots,  performed  in-lab  UAV  simulations  that  presented  the  visual  world  and 
collected  dependent  measures.  Results:  Results  indicated  that  with  the  low-reliability 
aids,  false  alarms  correlated  with  poorer  performance  in  the  system  failure  task, 
whereas  misses  correlated  with  poorer  performance  in  the  concurrent  tasks.  Conclu¬ 
sion:  Compliance  and  reliance  do  appear  to  be  affected  by  false  alarms  and  misses, 
respectively,  and  are  relatively  independent  of  each  other.  Application:  Practical 
implications  are  that  automated  aids  must  be  fairly  reliable  to  provide  global  benefits 
and  that  false  alarms  and  misses  have  qualitatively  different  effects  on  performance. 


INTRODUCTION 

Unmanned  aerial  vehicles  (UAVs)  are  now 
commonly  used  to  fulfill  military  reconnaissance 
missions  without  endangering  human  pilots.  The 
current  study  considered  the  role  of  imperfect 
automation  in  buffering  multitask  interference,  as 
a  single  UAV  pilot  may  be  called  upon  to  perform 
the  multiple  tasks  required  of  UAV  supervision 
and  control. 

Imperfect  Automation 

Previously,  Dixon,  Wickens,  and  Chang  (2005) 
employed  a  perfectly  reliable  auditory  autoalert 
system  to  aid  pilots  in  detecting  system  failures 
during  simulated  military  reconnaissance  mis¬ 
sions,  and  they  found  that  these  autoalerts  im¬ 
proved  performance  in  the  automated  task  with  no 
performance  loss  in  either  of  two  concurrent  tasks. 
Unfortunately,  these  types  of  alerting  aids  are 
rarely  entirely  reliable;  subsequently,  questions 


arise  as  to  the  effect  of  unreliable  automation  on 
pilot  trust,  dependence,  and  human-automation 
performance.  Imperfect  automation  has  been 
shown  to  create  different  states  of  overtrust,  under¬ 
trust,  or  calibrated  trust  (Parasuraman  &  Riley, 
1997),  “complacency”  (Metzger  &  Parasuraman, 
2005;  Parasuraman,  Molloy,  &  Singh,  1993),  and 
performance  loss  (Molloy  &  Parasuraman,  1996). 

In  spite  of  such  reported  problems,  imperfect 
automation  clearly  can  assist  human  operator  per¬ 
formance  (e.g.,  Galster,  Bolia,  Roe,  &  Parasura¬ 
man,  2001 ;  St.  John  &  Manes,  2002;  Yeh,  Merlo, 
Wickens,  &  Brandenburg,  2003),  particularly  in 
circumstances  when  human  resources  to  the 
unaided  task  are  insufficient  (e.g.,  Maltz  &  Shinar, 
2003;  Yaacov,  Maltz,  &  Shinar,  2003)  and,  there¬ 
fore,  the  human  must  depend  upon  the  automa¬ 
tion.  Such  resource  scarcity  may  result  either  when 
the  task  itself  is  difficult  (Maltz  &  Shinar,  2003)  or 
when  the  automated  task  is  carried  out  in  a  multi¬ 
task  context  (C.  D.  Wickens  &  Dixon,  2005). 
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Diagnostic  Failures:  Misses  and  False 
Alarms 

The  focus  of  the  current  study  was  on  imperfect 
automation  diagnostic  alerting  systems,  in  which 
the  automation  attempted  to  distinguish  two  pos¬ 
sible  states  of  the  world:  a  “safe”  state  and  a  “dan¬ 
gerous”  one  (Swets  &  Pickett,  1982).  The  sources 
of  imperfection  in  such  systems  relate  to  imper¬ 
fect  sensors  and  algorithms  as  well  as  to  noisy  or 
probabilistic  data  in  an  uncertain  world.  The  per¬ 
formance  of  such  systems  can  generally  be  rep¬ 
resented  in  the  framework  of  signal  detection 
theory  (Green  &  Swets,  1988;  T.  D.  Wickens, 
2002),  whereby  the  consequences  of  the  imper¬ 
fection  show  up  as  automation  misses  and/or  false 
alarms. 

In  application,  the  automation  designer  typi¬ 
cally  has  the  opportunity  to  set  “beta”  (the  thresh¬ 
old  of  the  alerting  system)  in  a  way  that  will  trade 
off  the  relative  frequency  of  these  two  kinds  of 
automation  errors.  At  issue  is  where  this  trade-off 
should  optimally  be  set.  If  the  output  of  the  auto¬ 
matic  diagnostic  process  directly  triggers  a  deci¬ 
sion,  then  the  optimal  criterion  could  easily  be 
calculated  by  applying  some  expected  value  al¬ 
gorithm  to  the  consequences  of  the  two  sorts  of 
resulting  actions.  However,  this  process  becomes 
complicated  when  the  human  operator  also  has 
parallel  access  to  the  same  perceptual  “raw  data” 
processed  by  the  automation,  bringing  qualita¬ 
tively  different  strengths  of  perceptual  analysis  to 
bear.  Here  the  optimal  setting  may  vary  (Sorkin 
&  Woods,  1985).  In  such  cases,  an  automation 
miss  may  not  inevitably  create  a  total  system  miss 
if  the  human  is  somewhat  vigilant  of  the  raw  data. 
Furthermore,  in  those  multitask  situations  in  which 
automation  dependence  is  critical  because  of  high 
workload,  the  costs  to  total  system  performance 
must  also  account  for  the  costs  (of  automation 
misses  and/or  automation  false  alarms)  to  human 
performance  on  concurrent  tasks. 

There  is  some  evidence  that  the  generic  costs  of 
alerting  system  false  alarms  may  be  greater  than 
those  of  misses.  For  example,  Bliss  (2003)  found 
that  pilots  reported  more  than  twice  as  many  alert- 
related  aviation  incidents  related  to  false  alarms 
as  compared  with  those  related  to  misses  (although 
this  disparity  may  reflect  a  higher  base  rate  of  false 
alert  events).  Maltz  and  Shinar  (2003)  observed 
a  similar  asymmetry  in  their  laboratory  data. 


Furthermore,  false  alarms  are  well  known  to  cause 
annoyance,  to  lead  to  unnecessary  evasive  actions, 
and,  in  the  worst-case  scenario,  to  lead  to  suffi¬ 
cient  distrust  of  the  automated  system  that  true 
alarms  are  ignored  -  the  “cry  wolf’  syndrome 
(Breznitz,  1983;  Parasuraman  &  Riley,  1997; 
Sorkin,  1989).  Despite  such  evidence,  it  is  impor¬ 
tant  to  note  that  in  many  situations,  misses  may  be 
more  costly  than  false  alarms  (e.g.,  air  traffic  con¬ 
trol)  and  that  experts  may  be  more  accepting  of 
false  alarms  than  of  misses  (Masalonis  &  Para¬ 
suraman,  1999). 

Reliance  Versus  Compliance 

The  qualitative  distinction  between  the  two 
kinds  of  diagnostic  imperfections  is  important 
because  of  the  recent  dichotomization  of  two  very 
different  cognitive  states  -  reliance  and  compli¬ 
ance  -  that  are  associated  with  automated  diag¬ 
nostic  systems  committing  one  or  the  other  type  of 
error,  particularly  under  conditions  of  high  work¬ 
load  (Maltz  &  Shinar,  2003;  Meyer,  2001, 2004). 
We  consider  these  two  states  to  be  two  different 
manifestations  of  automation  dependence,  a  de¬ 
pendence  that  will  be  inversely  related  to  automa¬ 
tion  reliability  in  resource-scarce  circumstances. 
Here  reliance  refers  to  the  human  operator  state 
when  the  alert  is  silent,  signaling  “all  is  well.” 
Reliant  operators  will  have  ample  resources  to 
allocate  to  concurrent  tasks  because  they  rely  on 
the  automation  to  let  them  know  when  a  problem 
occurs  on  the  automated  task.  Miss-prone  automa¬ 
tion  will  degrade  reliance,  particularly  under  high 
workload,  and  as  a  result  should  lead  to  decre¬ 
ments  in  concurrent  tasks.  In  forcing  the  operator 
to  pay  closer  attention  to  the  raw  data  of  the  alert¬ 
ed  domain,  there  should  be  more  effective  detec¬ 
tion  of  those  (now  more  frequent)  misses  made 
by  the  automation  system.  Conversely,  highly 
reliable,  low-miss  automation,  although  availing 
ample  resources  for  concurrent  tasks,  should  leave 
the  operator  quite  vulnerable  to  the  rare  automation 
misses  during  high  workload  -  the  “complacency” 
effect  (Bainbridge,  1983;  Molloy  &  Parasuraman, 
1996;  Parasuraman  et  al.,  1993). 

In  contrast,  compliance  describes  the  operator’s 
response  when  the  alarm  sounds,  whether  true  or 
false.  A  compliant  operator  will  rapidly  switch 
attention  from  concurrent  activities  to  the  alarm 
domain  (and  possibly  immediately  initiate  an 
alarm-appropriate  response,  such  as  leaving  the 
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building  upon  hearing  a  fire  alarm).  Automation 
that  is  prone  to  false  alarms  will  degrade  compli¬ 
ance,  the  consequences  of  which  are  a  delayed 
response  (or  possibly,  no  response  at  all)  to  a  true 
alarm  (Breznitz,  1983;  Sorkin,  1989). 

Although  the  research  of  Meyer  (2004)  has 
suggested  that  the  two  may  be  somewhat  inde¬ 
pendent  states,  with  separate  factors  affecting  re¬ 
liance  and  compliance,  these  two  constructs  have 
not  been  separately  and  quantitatively  evaluated 
within  a  multitask  context  in  which  resources  are 
scarce  and  the  threshold  of  an  alarm  system  is  sys¬ 
tematically  varied  to  alter  the  two  types  of  automa¬ 
tion  errors.  Maltz  and  Shinar  (2003)  imposed  such 
variation  but  did  so  within  a  single-task  context 
in  which  resource  demand  (and  automation  depen¬ 
dence)  was  created  by  a  more  demanding  task. 
Unfortunately,  no  study  of  imperfect  alert  automa¬ 
tion  has  systematically  varied  the  threshold  of  the 
alert  system  within  a  dual-task  context,  where 
the  consequences  of  allocating  resources  to  the 
secondary  task  can  be  assessed. 

The  Current  Study 

Because  of  its  inherent  multitask  nature  (Dix¬ 
on  et  al.,  2005)  and  ecologically  valid  properties, 
UAV  simulation  provided  an  ideal  test  bed  for  two 
experiments  that  examined  the  issues  of  imperfect 
automation  in  dual-task  settings.  In  both  experi¬ 
ments,  participants  conducted  simulated  recon¬ 
naissance  missions  in  which  they  were  responsible 
for  navigating  an  UAV  to  10  different  command 
targets  and  for  reporting  details  of  those  targets 
to  mission  command  (Dixon  et  al.,  2005).  This 
was  considered  the  primary  task.  Simultaneously, 
they  were  required  to  search  for  possible  targets 
of  opportunity  (TOOs)  along  the  way.  Upon  de¬ 
tecting  targets,  a  high- workload  camera  zoom  and 
inspect  task  was  engaged.  This  was  considered 
the  secondary  or  concurrent  task,  upon  which  the 
hypothesized  effects  of  reliance  could  be  ob¬ 
served.  Participants  also  had  to  monitor  on-board 
system  parameters  for  possible  failures.  This  was 
considered  the  imperfect  diagnostic  automation 
task  supporting  the  primary  task,  given  that  an  au¬ 
ditory  automation-alert  aid  was  sometimes  avail¬ 
able  to  indicate  when  these  system  failures  had 
occurred. 

In  Experiment  1,  this  aid  was  either  perfectly 
reliable  or  67%  reliable  (producing  either  false 
alarms  or  misses  in  two  different  conditions).  A 


fourth  condition,  with  no  automation,  provided 
baseline  data  with  which  to  compare  these  auto¬ 
mated  conditions.  In  Experiment  2,  participants 
were  assisted  by  the  same  autoalert  aid  but  with 
reliability  levels  of  80%  (producing  both  a  false 
alarm  and  a  miss)  and  two  conditions  of  60%  (pro¬ 
ducing  both  false  alarms  and  misses  at  a  3: 1  ratio, 
and  vice  versa).  The  multiple  levels  of  automation 
reliability  achieved  by  varying  miss  and  false 
alarm  rate  independently  across  the  two  experi¬ 
ments  provide  data  to  validate  a  computational 
model  of  dependence  on  imperfect  automation. 
More  specifically,  our  experiments  address  four 
hypotheses: 

HI :  The  symptoms  of  automation  dependence 
(benefits  if  correct,  costs  if  incorrect)  will  emerge 
primarily  at  high  workload.  Automation  imper¬ 
fection  driven  by  misses  and  false  alarms  would 
show  qualitatively  different  effects  as  reflected  by 
measures  of  reliance  and  of  compliance  respec¬ 
tively. 

H2:  Indices  of  high  reliance  will  decrease  with 
increasing  miss  rate.  High  reliance  is  indicated  by 
good  target-of-opportunity  performance  and  com¬ 
mand  target  memory  and  also  by  a  slow  response 
to  the  rare  system  failure  miss  when  automation  is 
reliable. 

H3:  Indices  of  high  compliance  will  decrease 
with  increasing  false  alarm  rate.  High  compliance 
is  indicated  by  rapid  and  accurate  responses  to  all 
alerts,  whether  true  or  false. 

H4:  The  two  vectors  of  reliance  and  compli¬ 
ance  will  show  relative  independence  from  each 
other. 

METHODS:  EXPERIMENT  1 
Participants 

Thirty-two  undergraduate  and  graduate  stu¬ 
dents  received  $8/hr,  plus  bonuses  of  $20,  $10,  and 
$5,  for  first-,  second-,  and  third-place  finishes,  re¬ 
spectively,  out  of  groups  of  8  participants.  Partici¬ 
pants  were  made  aware  of  the  incentives  and  told 
how  the  overall  task  performance  would  be  cal¬ 
culated.  Twenty  of  the  participants  were  licensed 
pilots,  who  were  equally  distributed  across  con¬ 
ditions. 

Apparatus 

The  experimental  simulation  ran  on  an  Evans 
and  Sutherland  SimFusion  4000q  system.  The  UAV 
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display  was  generated  on  an  OPENsim  Graphics 
card  on  a  Hitachi  CM721F  19-inch  (48-cm)  mon¬ 
itor,  using  1280  x  1024  resolution.  Figure  1  pre¬ 
sents  a  sample  display  for  a  single  UAV. 

As  shown  in  Figure  1,  the  experimental  en¬ 
vironment  was  subdivided  into  four  separate 
windows.  The  top  left  window  contained  a  3-D 
egocentric  image  view  of  the  terrain  directly  be¬ 
low  the  UAV  (6000  feet  altitude).  During  regular 
tracking  periods,  the  operator  could  only  view 
straight  down  to  the  ground.  During  a  loiter  pat¬ 
tern,  the  operator  was  able  to  zoom  and  to  extend 
the  viewing  angle  from  0°  to  90°  along  both  the 
x  and  y  axes.  The  bottom  left  window  contained 
a  2-D  top-down  map  of  the  20  x  20  mile  (32  x  32 
km)  simulation  world.  Coordinates  from  0°  to  90° 
were  placed  along  the  x  and  y  axes  for  navigation 
purposes.  The  bottom  center  window  contained 
the  message  box,  with  “fly  to”  coordinates  and 
command  target  (CT)  report  questions.  These 
flight  instructions  were  present  for  15  s  and  could 
be  refreshed  for  another  15  s  at  any  time  during 
the  mission  by  pressing  a  “repeat”  key.  The  bottom 
right  window  contained  the  four  system  gauges 
for  the  system  failure  monitoring  task.  The  white 
bars  oscillated  up  and  down  continuously,  each 
driven  by  sine  waves  ranging  in  bandwidth  from 
0.01  Hz  to  0.025  Hz.  A  system  failure  occurred 


when  one  of  the  white  bars  moved  into  a  red  zone, 
indicated  in  gray  at  the  tops  and  bottoms  of  the 
gauges  in  Figure  1 .  Participants  used  a  Fogitech 
Digital  3-D  joystick  to  manipulate  the  aircraft/ 
camera  and  an  X-Key  20-button  keypad  to  indi¬ 
cate  responses. 

Procedure 

Each  participant  flew  one  UAV  through  10 
consecutive  mission  legs.  During  each  leg,  the 
participant  completed  three  goal-oriented  tasks 
that  are  commonly  associated  with  UAV  flight 
control:  mission  navigation  and  command  target 
inspection,  target  of  opportunity  (TOO)  search, 
and  systems  monitoring.  At  the  beginning  of  each 
mission  leg,  participants  obtained  their  flight  in¬ 
structions  for  that  leg  via  the  message  box.  Once 
participants  arrived  at  the  CT  location,  they  loitered 
around  the  target,  manipulated  a  camera  for  closer 
target  inspection  via  a  joystick,  and  reported  back 
relevant  information  to  mission  command  (e.g., 
“What  weapons  are  located  on  the  south  side  of 
the  building?”). 

Along  each  mission  leg,  participants  were  also 
responsible  for  detecting  and  reporting  the  low- 
salience  TOOs,  a  task  similar  to  the  CT  report 
except  that  the  TOOs  were  much  smaller  (l°-2° 
of  visual  angle)  than  the  CT  report  objects  and  were 


Figure  1.  UAV  simulation  display.  The  actual  display  was  larger,  had  better  resolution,  and  was  color-coded. 
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camouflaged.  This  was  considered  the  secondary 
or  concurrent  task.  TOOs  were  located  random¬ 
ly  somewhere  in  the  middle  60%  of  each  leg;  if  a 
TOO  was  found,  a  report  response  with  zooming 
and  panning  was  required,  much  like  the  CT  re¬ 
port.  TOOs  could  become  visible  during  simple 
tracking  (low  workload)  or  during  a  participant  re¬ 
sponse  to  a  system  failure  (high  workload).  These 
two  types  of  TOOs  occurred,  respectively,  with  a 
ratio  of  roughly  4:1.  Upon  making  a  TOO  report, 
the  UAV  was  reoriented  by  the  pilot  to  continue 
its  original  trajectory  toward  the  command  target. 

Concurrently,  participants  were  also  required 
to  monitor  the  system  gauges  for  possible  system 
failures  (SFs),  which  were  designed  to  fail  during 
either  simple  tracking  (i.e.,  low  workload:  easy 
concurrent  task)  or  TOO  and  CT  zoom/loiter  in¬ 
spection  (i.e.,  high  workload:  difficult  concurrent 
task).  SFs  lasted  only  30  s,  after  which  the  screen 
flashed  bright  red  and  a  harsh  auditory  alarm  an¬ 
nounced  that  the  participant  had  failed  to  detect 
the  SF.  There  were  a  total  of  10  SFs,  with  no  more 
than  2  occurring  during  any  mission  leg.  SFs  were 
temporally  separated  by  4  to  10  min.  Some  SFs 
were  alerted  with  an  automated  auditory  warning 
system  (i.e.,  a  tone). 

Design 

The  auditory  autoalerts  for  the  SFs  were  pro¬ 
vided  for  thr  ee  out  of  the  four  conditions,  using  a 
between-subjects  design  (8  participants/group). 
The  A 100  condition  (A=  automation,  100%  relia¬ 
ble)  provided  10  true  alarms  with  10  SF  events.  The 
A67f  condition  (f  =  false  alarm,  67%  reliable)  pro¬ 
vided  10  true  alerts  and  an  additional  5  false 
alarms.  The  A67m  condition  (m  =  miss,  67 %  reli¬ 
able)  provided  10  true  alerts  but  failed  to  alert  an 
additional  5  events  (10  true  alarms  plus  5  misses). 
During  a  false  alarm,  the  participant  was  instruct¬ 
ed  to  ignore  the  warning  after  cross-checking 
with  the  raw  data  to  confirm  the  inaccuracy  of  the 
alarm.  If  an  automation  miss  occurred,  the  partic¬ 
ipant  was  instructed  that  he  or  she  was  still  respon¬ 
sible  for  “catching”  the  SF  and  correcting  it.  The 
final  condition  was  a  baseline  condition,  with  no 
automation  aid  to  assist  participant  performance. 

RESULTS:  EXPERIMENT  1 

Three  planned  comparisons  were  used  through¬ 
out  to  assess  statistical  effects.  For  each  dependent 


measure,  the  following  were  compared:  (a)  base¬ 
line  versus  the  combination  of  A67f  and  A67m  in 
a  planned  comparison  (i.e.,  weights  of  -1, 0.5, 0.5); 
(b)  baseline  versus  A100;  and  (c)  A67f  versus 
A67m.  Because  only  three  a  priori  comparisons 
were  implemented  to  view  important  differences 
between  particular  groups  of  interest,  familywise 
error  rates  were  not  adjusted  (see  Keppel,  1982, 
for  more  details).  One  participant  in  the  baseline 
condition  was  dropped  because  the  data  file  was 
corrupted.  Note  that  because  of  frequent  missing 
data  points  (e.g.,  if  a  target  does  not  come  into 
view  on  the  3-D  display,  then  a  participant  has  no 
chance  to  detect  it;  or  if  a  participant  does  not  de¬ 
tect  a  target,  then  there  are  no  data  for  the  target 
detection  times),  the  degrees  of  freedom  in  the  fol¬ 
lowing  comparisons  are  sometimes  less  than  the 
maximum  value.  Table  1  presents  the  data. 

Primary  Task:  Mission  Navigation  and  CT 
Inspection 

Tracking  error  and  CT  reporting.  Planned  com¬ 
parisons  revealed  no  main  effect  for  tracking  error 
(all  ps  >  .  10)  or  for  CT  reporting  speed  and  accu¬ 
racy.  Participants  clearly  treated  mission  naviga¬ 
tion  and  CT  inspection  as  the  primary  task. 

Repeats.  Planned  comparisons  revealed  that 
the  67%  reliable  conditions  (mean  of  A67f  and 
A67m)  did  not  statistically  differ  from  baseline, 
t(20)  =  1 .49,/;  > .  10.  There  was  also  no  significant 
difference  between  the  A100  condition,  t(13)  < 
1.0,  and  baseline.  However,  the  A67m  condition 
generated  twice  as  many  repeats  as  the  did  A67f 
condition,  f(14)  =  2.52,  p  =  .01.  Thus,  miss-prone 
automation  imposed  more  of  a  load  on  memory, 
which  was  compensated  by  the  repeat  key,  relative 
to  false-alarm-prone  automation. 

Secondary  Task:  TOO  Monitoring 

TOO  detection  rates.  Planned  comparisons 
revealed  no  significant  difference  between  the 
baseline  condition  and  the  67%  reliable  condi¬ 
tions,  f(18)  <  1.0,  or  the  A100  condition,  t{  12)  < 
1 .0;  however,  detection  rates  were  significantly 
lower  in  the  A67m  (miss)  condition  than  in  the 
A67f  (false  alarm)  condition  in  both  the  low- 
workload,  /(12)  =  2.25,/?  <  .05,  and  high-workload 
trials,  1(12)  =  2.20,  p  <  .05. 

TOO  detection  times.  Because  low-workload 
trials  revealed  no  effects  of  condition  on  TOO  de¬ 
tection  times  (all  yzs  >  .  10),  we  focused  primarily 
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TABLE  1 :  An  Overview  of  the  Data  from  Experiment  1 

Baseline 

A100 

A67f 

A67m 

Tracking  error  (MAE  in  meters) 

84.25 

83.80 

79.32 

83.08 

(0.81) 

(0.69) 

(4.61) 

(1.03) 

Number  of  repeats  (per  leg) 

3.03 

2.25 

3.04”* 

6.5"* 

(0.82) 

(0.48) 

(0.67) 

(1.20) 

CT  detection  time  (s) 

2.45 

2.41 

2.31 

3.37 

(0.80) 

(0.51) 

(0.31) 

(1.07) 

TOO  detection  rate  (%) 

58.57 

56.57 

65.56" 

41.25" 

(6.7) 

(1.3) 

(6.1) 

(9.0) 

TOO  detection  time  (s) 

High  workload 

6.03 

7.83 

13.82* 

7.7* 

(1.99) 

(0.96) 

(3.08) 

(2.07) 

Low  workload 

6.04 

5.32 

5.38 

6.59 

(0.91) 

(1.0) 

(0.96) 

(3.1) 

SF  detection  rate  (%) 

Low  load 

100.0 

100.0 

94.46 

97.92 

(0.0) 

(0.0) 

(4.2) 

(1.4) 

High  load 

95.83 

88.0 

68.75" 

92.19" 

(4.2) 

(7.1) 

(7.8) 

(5.2) 

SF  detection  time  (s) 

Low  load 

2.17 

3.00 

2.69 

3.15 

(0.35) 

(0.71) 

(1.19) 

(0.61) 

High  load 

10.75" 

3.21" 

11.0 

13.75 

(3.51) 

(0.52) 

(2.34) 

(2.06) 

SF  report  accuracy  (%) 

88.36 

91.22 

96.58 

96.67 

(3.0) 

(2.2) 

(1.3) 

(1.1) 

Note.  SE  values  are  in  parentheses.  MAE  =  mean  absolute  error,  CT  =  command  target,  TOO  =  target  of  opportunity,  SF  =  system 
failure. 

*p<.10.  **p  <  .05.  ***p<.01. 


on  high-workload  trials,  when  participants  were 
concurrently  dealing  with  an  SF,  and  resources 
were  assumed  to  be  scarce.  Planned  comparisons 
revealed  no  statistical  difference  between  the  mean 
of  the  67%  reliable  conditions  relative  to  base¬ 
line,  f(ll)<  1.0,  or  the  A 100  condition  relative  to 
baseline,  f(10)  <  1.0.  However,  the  A67f  condi¬ 
tion  may  have  generated  longer  detection  times 
than  the  A67m  condition  did,  t( 6)  =  1 .40,  p  =  .  10 
(approaching  significance). 

SF  Monitoring 

SF  detection  rates.  The  main  focus  of  interest 
in  the  SF  task  was  during  high-workload  trials 
(i.e.,  concurrent  with  TOO  inspection),  when 
resources  were  assumed  to  be  scarce,  as  low- 
workload  trials  showed  no  effects  (HI).  Planned 
comparisons  revealed  that  the  67%  reliable  con¬ 
ditions  resulted  in  poorer  detection  rates  than  did 
the  baseline  condition,  t(19)  =  1.97,  p  =  .06 


(approaching  significance);  however,  these  effects 
were  probably  attributable  to  the  A67f  condition 
(69%),  in  which  performance  was  much  worse 
than  in  the  A67m  condition  (92%),  rf  14)  =  2.32, 
p  <  .05.  The  A100  condition  did  not  differ  statis¬ 
tically  from  baseline,  r(10)  <  1.0. 

SF  detection  times.  As  with  SF  detection  rates, 
the  only  effects  were  observed  in  high- workload 
trials.  Figure  2  presents  the  overall  SF  detection 
times  as  a  function  of  workload  and  reveals  that 
performance  in  the  67%  reliable  conditions  was 
no  better  than  in  the  baseline  condition,  f(20)  <  1 .0, 
whereas  performance  in  the  A 100  condition  was 
better  than  in  baseline,  r(l  1 )  =  1.96,  p  <  .05.  The 
A67f  and  A67m  conditions  did  not  differ  statisti¬ 
cally  overall,  t(  14)  <  1.0.  However,  it  is  interesting 
to  note  that  the  A67m  condition  resulted  in  detec¬ 
tion  times  slower  than  those  in  the  A67f  condition 
on  those  occasions  when  the  automation  failed  to 
notify  the  participants  of  an  SF,  f(14)  =  2.64,  p  < 
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Figure  2.  SF  detection  times  across  condition  and  workload.  Experiment  1 .  The  A67m  condition  is  divided  into  two 
subgroups:  a)  Automation  true  alerts  (67%  of  the  time),  and  b)  Automation  misses  (33%  of  the  time).  SE  bars  are 
included. 


.01,  reflecting  a  form  of  complacency,  as  shown 
by  the  bars  at  the  right  in  Figure  2.  Furthermore, 
there  was  a  tendency  for  faster  response  times 
(RTs),  compared  with  the  A67F  condition,  on 
those  occasions  when  the  alarm  sounded. 

DISCUSSION:  EXPERIMENT  1 

Participants  were  effective  at  protecting  the 
primary  task  indices  of  tracking  and  CT  report 
accuracy.  As  hypothesized  (HI),  automation  reli¬ 
ability  effects  were  also  seen  most  strongly  in 
high-workload  situations.  Perfect  automation  had 
a  beneficial  effect,  relative  to  baseline,  on  perfor¬ 
mance  in  the  automated  task,  but  it  had  no  bene¬ 
fit  on  concurrent  task  performance,  replicating 
Metzger  and  Parasuraman  (2005)  and  previous 
UAV  studies  (e.g.,  Dixon  et  al.,  2005).  Impor¬ 
tantly,  imperfect  automation  (67%)  hurt  both  the 
automated  task  and  concurrent  tasks,  even  drop¬ 
ping  these  below  baseline  in  some  cases.  False 
alarms  and  misses  yielded  qualitatively  different 
kinds  of  effects  related  to  compliance  (H3)  and 
reliance  (H2),  respectively.  False  alarms  hurt  the 
system- monitoring  task  by  reducing  SF  detection 
rates  and  increasing  SF  detection  times  as  com¬ 
pared  with  baseline.  This  indicates  that  the  oper¬ 
ators  were  less  compliant  with  the  autoalerts 
(reduced  compliance).  Misses  hurt  performance 
in  remembering  flight  instructions  and  possibly  in 
the  target  search  task,  indicating  a  reduction  in  re¬ 
liance.  We  discuss  these  effects  in  more  detail  fol¬ 


lowing  the  presentation  of  converging  evidence 
provided  by  Experiment  2. 

METHODS:  EXPERIMENT  2 

The  procedures  of  Experiment  2  replicated 
those  of  Experiment  1  with  the  following  excep¬ 
tions:  No  baseline  condition  was  run.  An  A80  con¬ 
dition  (A  =  automation,  80%  reliable)  failed  by 
giving  1  false  alarm  and  1  miss  during  each  mis¬ 
sion  (8  true  alarms,  1  miss,  and  1  false  alarm). 
These  2  automation  failures,  occurring  out  of  a 
possible  10  alerted  system  failures,  defined  a  .80 
reliability  level  (1-2/10).  An  A60f  condition  (f  = 
false  alarm,  60%  reliable)  was  created  by  impos¬ 
ing  3  automation  false  alarms  and  1  automation 
miss  (4  automation  failures)  out  of  the  10  possible 
system  failures.  An  A60m  condition  (m  =  miss, 
60%  reliable)  resulted  in  3  misses  and  1  false  alarm 
(6  true  alarms  plus  3  misses  and  1  false  alarm). 
Participants  were  not  aware  of  the  precise  level 
of  reliability  provided  by  each  automation  aid; 
however,  in  contrast  to  Experiment  1,  depending 
on  the  participants’  assigned  condition,  they  were 
told  in  advance  that  the  automation  was  either 
“fairly  reliable”  or  “not  very  reliable”  as  well  as 
the  bias  setting  of  the  alert  (i.e.,  more  false  alarms 
or  more  misses).  There  were  24  participants 
(8/group),  none  of  whom  participated  in  Experi¬ 
ment  1.  Participants  were  of  the  same  demo¬ 
graphics  as  those  in  Experiment  1,  including  the 
same  proportion  of  pilots  to  nonpilots. 
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RESULTS:  EXPERIMENT  2 

Because  of  the  between-subjects  design  and 
the  close  temporal  proximity  of  the  two  experi¬ 
ments,  the  baseline  data  for  Experiment  1  were 
used  in  the  data  analysis  of  Experiment  2  as  well. 
Table  2  presents  an  overview  of  the  data.  As  with 
Experiment  1,  statistical  inference  was  based  on 
planned  contrasts  of  baseline  versus  60%  relia¬ 
bility  (mean  of  A60f  and  A60m),  baseline  versus 
A80,  and  A60f  versus  A60m. 

Mission  Completion 

Planned  comparisons  revealed  no  main  effect 
for  tracking  error  or  for  CT  report  accuracy  (all 
ps  >  .10),  bndings  consistent  with  Experiment  1. 
However,  planned  comparisons  did  reveal  that 
for  CT  detection  times  (i.e.,  how  long  it  took  par¬ 
ticipants  to  detect  the  CT  once  it  entered  the  3-D 


display),  performance  in  the  two  60%  reliable 
conditions  was  worse  than  baseline,  r(20)  =  2.77, 
p  <  .05,  whereas  the  A80  condition  did  not  differ 
from  baseline,  r(13)  <  1.0.  There  was  no  statistical 
difference  between  the  A60f  and  A60m  condi¬ 
tions,  t(14)  <  1.0.  Compared  with  baseline,  both 
the  60%  reliable  conditions,  f(19)  =  2.49,  p  <  .05, 
and  the  A80  condition,  /(ll)  =  1.72,  p  =  .06  (ap¬ 
proaching  significance),  generated  more  repeats. 
The  A60m  condition  generated  significantly 
more  repeats  than  did  the  A60f  condition,  t(  14)  = 
1.85,/?  <  .05. 

TOO  Monitoring 

TOO  detection  rates.  Planned  comparisons 
revealed  that  there  was  no  difference  between  the 
60%  reliable  conditions  and  baseline,  f(20)  =1.17, 
p  >  .10,  whereas  performance  in  the  A80  condi¬ 
tion  was  better  than  baseline,  ?(13)  =  2.15,/?<  .05. 


TABLE  2:  An  Overview  of  the  Data  from  Experiment  2 


Baseline 

A80 

A60f 

A60m 

Tracking  error  (MAE  in  meters) 

84.25 

84.45 

82.75 

85.76 

(0.81) 

(1.95) 

(5.11) 

(1.92) 

Number  of  repeats  (per  leg) 

3.03" 

5.57* 

5.25” 

8.5" 

(0.82) 

(1.72) 

(1.65) 

(1.59) 

CT  detection  time  (s) 

2.45" 

1.96 

4.16" 

4.11** 

(0.80) 

(1.07) 

(1.10) 

(1.84) 

TOO  detection  rate  (%) 

58.57" 

93.0" 

87.0 

82.0 

TOO  detection  time  (s) 

(6.7) 

(7.4) 

(7.1) 

(7.2) 

High  workload 

6.03 

8.58 

14.72"* 

1 1 .86*** 

(1.99) 

(2.82) 

(2.63) 

(5.51) 

Low  workload 

6.04 

5.94 

6.68 

5.89 

SF  detection  rate  (%) 

(0.91) 

(1.28) 

(1.20) 

(1.24) 

Low  load 

100.0 

100.0 

97.0 

98.0 

(0.0) 

(2.8) 

(2.7) 

(2.7) 

High  load 

95.83 

69.0 

50.0 

75.0 

SF  detection  time  (s) 

(4.2) 

(19.7) 

(53.0) 

(26.0) 

Low  load 

2.17 

2.08 

2.50 

3.15 

(0.35) 

(0.71) 

(0.19) 

(0.19) 

High  load 

10.75 

11.27 

19.98" 

13.62** 

(3.51) 

(3.31) 

(3.19) 

(3.20) 

SF  report  accuracy  (%) 

88.36 

97.0 

98.0 

94.0 

(3.0) 

(4.8) 

(4.4) 

(5.0) 

Note.  SE  values  are  in  parentheses.  MAE  =  mean  absolute  error,  CT  =  command  target,  TOO  =  target  of  opportunity,  SF  =  system 
failure. 

*p<.10.  **p  <  .05.  ***p<.01. 
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There  was  no  significant  difference  between  the 
A60f  and  the  A60m  conditions,  /f  14)  <  1.0. 

TOO  detection  times.  Figure  3  presents  TOO 
detection  times  as  a  function  of  condition  and 
workload.  On  low-workload  trials,  there  were  no 
effects  of  condition  (all  ps  >  .10). 

In  high-workload  trials,  planned  comparisons 
revealed  that  performance  in  the  60%  reliable 
conditions  was  worse  than  baseline,  t(  16)  =  3.09, 
p  <  .01,  but  there  was  no  difference  between  the 
A80  condition  and  baseline,  t(  12)  <  1.0.  A  com¬ 
parison  of  the  A60f  and  A60m  conditions  revealed 
no  significant  difference,  t(  11)  =  1.04,  p  >  .10, 
although  the  trend  toward  greater  decrement  with 
the  A60f  condition  is  consistent  with  that  ob¬ 
served  in  Experiment  1 . 

SF  Monitoring 

SF  detection  rates.  There  were  no  statistical  ef¬ 
fects  of  condition  on  SF  detection  rates  (all  ps  > 
.  10);  however,  the  reduced  rates  in  the  A60f  con¬ 
dition  in  high  workload  (50%),  as  compared  with 
the  other  conditions  (mean  =  74%),  are  consistent 
with  those  observed  in  Experiment  1. 

SF  detection  times.  Figure  4  presents  the  SF  de¬ 
tection  times  as  a  function  of  condition  and  work¬ 
load.  No  differences  in  performance  were  revealed 
in  the  low-workload  trials;  however,  in  the  high- 
workload  trials,  performance  in  the  60%  reliable 
conditions  may  have  been  worse  than  baseline, 
t( 20)  =  1 .89,  p  =  .07  (approaching  significance). 
This  difference  was  attributable  primarily  to  the 
A60f  condition,  in  which  performance  was  worse 
than  in  the  A60m  condition,  t(  1 4)  =  2. 1 6,  p  <  .05. 


The  A80  condition  did  not  differ  from  baseline, 
t(  13)  <1.0. 

In  Figure  4,  we  note  that  each  of  the  60%  con¬ 
dition  means  was  composed  of  two  different 
components:  responses  when  an  alert  correctly 
sounded  (A60f  =  13.93  s;  A60m  =  3.96  s)  and 
those  when  the  alert  failed  to  sound  (A60f  = 
26.05  s;  A60m  =  23.29  s).  These  data  within  the 
high-workload  condition  reveal  the  clear  slowing 
for  RT  when  the  alarm  “missed”  the  SF  event, 
indicating  that  in  both  conditions  participants  had 
relied  heavily  upon  the  automation  and  their 
detection  suffered  when  it  failed:  the  classic 
“complacency”  effect  (Parasuraman  et  al.,  1993). 
Although  this  complacency  effect  was  less  pro¬ 
nounced  in  the  miss-prone  condition,  the  differ¬ 
ence  between  the  two  error  conditions  did  not 
approach  significance.  Correct  alerts  were  re¬ 
sponded  to  more  rapidly  with  the  miss-prone 
automation  than  with  the  false-alarm-prone  auto¬ 
mation,  /( 14)  =  2.00,  p  <  .05,  reflecting  the  partic¬ 
ipants’  immediate  compliance  with  the  auditory 
alert  (Meyer,  2001,  2004)  in  the  former  condi¬ 
tion,  in  contrast  to  the  false-alarm-prone  condition, 
in  which  participants  were  less  likely  to  interrupt 
target  inspection  to  deal  with  the  alarms.  We  also 
infer  that  greater  compliance  in  the  miss-prone 
condition  was  coupled  with  an  ongoing  greater 
awareness  of  the  SF  gauges,  fostered  by  a  reduced 
reliance  on  that  automation  and  causing  the 
greater  disruption  to  CT  memory  recall  described 
previously.  The  difference  between  reliance  and 
compliance  effects  will  be  explored  later  in  greater 
detail. 


TOO  Detection  Times  (Exp  2) 


Baseline  A80  A60f  A60m 

Condition 


Figure  3.  Experiment  2:  TOO  detection  times  across  condition  and  workload.  SE  bars  are  included. 
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Figure  4.  Experiment  2:  SF  detection  times  across  condition  and  workload.  SE  bars  are  included. 


DISCUSSION:  EXPERIMENT  2 

As  with  Experiment  1,  the  primary  tasks  of 
tracking  and  CT  reporting  were  fully  protected 
from  the  effects  of  degraded  reliability,  although 
degraded  reliability,  particularly  that  prone  to 
misses,  induced  more  use  of  the  repeat  key  to  com¬ 
pensate  for  the  attention  demands  of  this  degrada¬ 
tion.  The  effects  on  other  tasks  were  primarily 
seen  in  high- workload  situations.  Highly  reliable 
automation  did  not  benefit  performance  in  the  au¬ 
tomated  task  relative  to  baseline,  but  it  had  a  small 
benefit  to  concurrent  task  performance.  Low- 
reliability  automation  (60%)  hurt  both  the  auto¬ 
mated  task  and  concurrent  tasks,  with  different 
effects  for  false  alarms  and  misses  related  to  com¬ 
pliance  and  reliance. 

MODELING  OF  AUTOMATION 
DEPENDENCE 

The  current  simulation  results  provide  an  ideal 
opportunity  to  evaluate  a  computational  version 
of  the  model  of  reliance  and  compliance  (Meyer, 
2001),  the  two  components  of  diagnostic  auto¬ 
mation  dependence.  Within  each  condition  it  is 
possible  to  assess  measures  of  reliance  and  com¬ 
pliance: 

Reliance  is  indexed  by  (a)  the  performance  on 
secondary  or  concurrent  tasks.  Here  TOO  accu¬ 
racy  and  detection  time  (during  non-SF  periods, 
when  reliance  was  necessary)  are  examined,  as  is 
frequency  of  use  of  the  memory  refresh  repeat  key 
(higher  reliance  better  performance  and  less 


use  of  the  memory  repeat),  (b)  Reliance  is  also 
indexed  by  the  time  required  to  respond  to  an  un¬ 
announced  failure  (e.g.,  RT  to  an  automation 
“miss”:  higher  reliance  longer  RT,  reflecting 
the  “complacency  effect”  with  highly  reliable  au¬ 
tomation;  Molloy  &  Parasuraman,  1996).  We 
evaluated  this  latter  measure  only  under  high- 
workload  conditions,  in  which  reliance  is  most 
likely  to  be  observed. 

Compliance  is  indexed  by  the  response  time 
and  accuracy  to  an  announced  system  failure 
(higher  compliance  ->  shorter  RT),  again  under 
high  workload. 

To  the  extent  that  reliance  and  compliance  are 
components  of  automation  dependency,  and  that 
operators  are  perfectly  calibrated  to  true  reliabil¬ 
ities,  we  predicted  that  those  two  vectors  of  re¬ 
liance  and  compliance  performance  measures 
should  be  linearly  affected  by  the  independent 
variables  of  miss  rate  (H2)  and  false  alarm  rate 
(H3),  respectively.  Furthermore,  to  the  extent  that 
it  is  an  independent  component,  each  vector 
should  be  unaffected  by  the  other  independent 
variable  (H4). 

Examination  of  the  data  revealed  that  all  four 
measures  of  reliance  showed  a  correlation  in  the 
expected  direction.  SF  automation  miss  rate  cor¬ 
relates  with  TOO  miss  rate,  r  =  .50;  RT  to  TOO, 
r—  .73;  repeats,  r=  .76;  andRT  to  SF  misses,  r  = 
-0.97  -  that  is,  higher  miss  rate  ->  less  reliance 
->  poorer  concurrent  task  performance  and  faster 
response  to  the  automation  miss  (Meyer,  2001; 
Parasuraman  et  al.,  1993). 

The  two  measures  of  SF  alert  compliance  were 
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assessed  at  high  workload,  when  the  participants’ 
attention  was  heavily  engaged  in  manipulating  the 
3-D  image  to  inspect  targets  (and  therefore  might 
be  more  reluctant  to  leave  the  image  inspection 
task  and  switch  to  the  alerted  system  display). 
Here  again,  the  correlations  were  in  the  expected 
direction.  The  correlation  of  automation  FA  rate 
with  RT  to  SF  was  r  =  .37;  with  SF  miss  rate  it 
was  r=  .73  -  that  is,  higher  FA  rate  ->  less  com¬ 
pliance  ->  slower  and  less  accurate  response  to 
the  SF  alerts.  Here  also,  as  with  one  of  the  TOO 
reliance  measures,  a  closer  model  fit  was  thwart¬ 
ed  by  an  Experiment  1  data  point  where,  for  FA  = 
5  (A67f  from  Experiment  1),  performance  was 
better  (more  compliance)  than  one  might  other¬ 
wise  predict  from  the  skeptical  participant  who  is 
mistrustful  of  a  false-alarm-prone  system.  By  way 
of  explanation,  we  note  that  in  Experiment  1 ,  par¬ 
ticipants  were  not  prealerted  to  the  high  false 
alarm  rate.  Hence  it  would  have  taken  a  few  tri¬ 
als  for  the  lack  of  compliance  to  evolve,  thereby 
diluting  the  effect. 

Hypothesis  4  posits  the  independence  of  com¬ 
pliance  from  miss  rate  and  of  reliance  from  false 
alarm  rate.  To  assess  this,  we  correlated  miss  rate 
to  the  two  indices  of  compliance.  The  correlations 
were  r(3)  =  .29,  p  =  .33  (SF  RT),  and  r(3)  =  -.33, 
p  =  .17  (SF  miss  rate),  supporting  such  a  model 
of  independence.  The  correlations  of  FA  rate  to 
the  four  indices  of  reliance  were  r( 2)  =  .92,/;=  .08 
(RT  to  SF  miss),  r( 3)  =  -.69,  p  =  .18  (TOO  miss 
rate),  r(3)  =  -.16,/;  =  .14  (RT  to  TOO),  and  r( 3)  = 
-.10,  p  =  .27  (repeats).  The  former  two  values, 
though  not  significant,  suggest  that  reliance  may 
have  been  affected  by  the  false  alarm  rate.  High 
false  alarm  rates  appear  to  have  produced  greater 
reliance  upon  the  automation,  although  this  claim 
cannot  be  proven  with  the  current  data. 

Because  all  individual  correlations  were  based 
on  a  small  N,  we  examined  Hypotheses  2  through 
4  in  a  different  way  to  increase  statistical  power. 
Each  variable  was  standardized  and  expressed  as 
a  proportion  of  the  range  between  minimum  and 
maximum  observed  value.  These  standardized 
values  were  inverted  where  necessary,  such  that 
changes  in  all  variables  within  a  vector  that  were 
associated  with  increases  in  reliance  or  com¬ 
pliance  were  of  the  same  sign.  The  standardized 
variables  within  each  vector  were  then  pooled. 
Correlations  on  the  pooled  data  revealed  that  miss 
rate  ->  reliance  (r  =  .67,  p  <  .01);  miss  rate  -> 


compliance  (r=  .07,  ns) :  FA  rate  reliance  (r  = 
-.50,  p  =  .06);  FA  rate  ->  compliance  (r  =  .49, 
p  =  .11).  As  we  will  discuss,  this  pattern  is  only 
partially  consistent  with  the  independence  hypoth¬ 
esis,  because  higher  false  alarm  rates  appear  to 
have  an  influence  on  reliance. 

GENERAL  DISCUSSION 

Prior  literature  has  well  established  that  perfect 
automation  will  offer  benefits  when  workload  is 
high,  either  because  the  task  being  automated 
is  challenging  (e.g.,  Maltz  &  Shinar,  2003)  or,  as 
in  the  current  case,  because  other  multitask  re¬ 
sponsibilities  are  competing  for  the  operators’ 
limited  attentional  resources  (C.  D.  Wickens  & 
Dixon,  2005).  The  current  data  confirmed  this 
effect,  as  A 1 00  performance  was  superior  to  base¬ 
line  performance  in  the  RT  to  system  failures  only 
at  high  workload,  supporting  HI.  Also,  there  are 
now  ample  data  showing  that  people  depend  on 
automation  even  when  it  is  imperfect,  and  here  we 
found  in  the  A80  condition  that  benefits  were  still 
evident  over  baseline  performance  in  detecting 
TOOs,  just  as  such  benefits  have  been  observed 
in  other  studies  (e.g.,  Maltz  &  Shinar,  2003;  St. 
John  &  Manes,  2002;  Yaacov  et  al.,  2003). 

In  the  current  experiment,  we  were  particular¬ 
ly  interested  in  the  manifestations  of  this  depen¬ 
dence  when  the  reliability  dropped  still  further 
and,  in  particular,  how  it  was  reflected  in  the  two 
components  of  dependence,  reliance  and  compli¬ 
ance,  articulated  by  Meyer  (2001, 2004).  We  found 
first,  in  support  of  Hypothesis  1,  that  dependence 
costs  emerged  more  markedly  under  high-workload 
than  under  low-workload  conditions.  This  was 
particularly  true  for  the  manifestations  of  compli¬ 
ance,  in  which  the  prolongation  of  RT  to  auditory 
alerts  with  the  false-alert-prone  system  was  ob¬ 
served  only  while  participants  were  concurrently 
engaged  in  TOO  and  CT  image  inspection  (high 
workload,  Figures  2  and  4),  and  only  in  this  con¬ 
dition  was  the  decrease  in  SF  detection  rate  evi¬ 
dent  (Experiment  1  only). 

We  also  found  support  for  Hypotheses  2  and  3 
when  examining  the  independent  effects  of  miss 
rate  on  reliance  and  false  alarm  rate  on  compli¬ 
ance,  respectively;  this  has  not  been  previously 
reported  in  a  multitask  experiment.  Our  approach 
was  through  creating  the  “vector”  measures  of 
each  construct.  Our  data  revealed  a  strong  effect 
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of  miss  rate  on  reliance  (r  =  .67),  as  participants 
became  less  trusting  of  the  automation  to  alert 
them  if  a  failure  occurred  and  (a)  allocated  more 
attention  to  monitoring  the  raw  data  at  the  expense 
of  two  concurrent  tasks  (TOO  monitoring  and  CT 
coordinate  memory)  but  (b)  caught  the  rare  auto¬ 
mation  miss  of  the  system  failure  more  frequent¬ 
ly.  Correspondingly,  we  found  support,  although 
somewhat  weaker  (lower  correlation,  r  =  .49),  for 
the  negative  effects  of  high  false  alert  rate  on  com¬ 
pliance,  reflecting  the  “cry  wolf’  phenomenon. 

Hypothesis  4  concerns  independence,  which 
was  not  explicitly  framed  as  a  property  of  reliance 
and  compliance  by  Meyer  (2004)  but  has  indeed 
appeared  to  be  an  implication  of  his  research. 
Here,  however,  the  data  were  mixed.  Indeed,  miss 
rate  appeared  to  have  little  influence  on  the  vec¬ 
tor  of  compliance.  The  participants’  attention  was 
drawn  more  or  less  to  the  alert,  independent  of  the 
imperfection  of  that  alert  when  it  was  silent.  Puz¬ 
zling,  however,  was  the  influence  of  false  alert  rate 
on  reliance  (r  =  -.50),  which  was  just  as  strong  as 
its  effect  on  compliance  ( r  =  .49).  Upon  closer 
examination  of  the  components  of  the  reliance 
vector,  the  direction  of  this  effect  (more  false 
alarms  ->  less  reliance)  was  driven  heavily  by  the 
fact  that  more  false  alarms  increased  the  response 
time  to  the  rare  automation-missed  system  fail¬ 
ure.  In  this  regard,  it  appears  that  a  false-alarm- 
prone  system  may  leave  the  operator  somewhat 
less  inclined  to  pay  any  attention  to  the  entire  auto¬ 
mated  domain,  whether  it  be  its  alerting  signal  or 
the  raw  data  contained  within. 

As  a  final  observation,  we  note  the  general 
pattern  of  the  current  data:  Our  two  lowest  levels 
of  reliability  clearly  inhibited  performance  below 
baseline,  whereas  our  higher  level  of  imperfect  re¬ 
liability  (.80)  showed  general  improvements.  Such 
findings  are  consistent  with  the  recent  integration 
of  the  literature,  suggesting  that  reliability  levels 
of  70%  to  75%  represent  a  rough  “threshold”  of 
imperfect  reliability  assistance  (C.  D.  Wickens  & 
Dixon,  2005).  Although  not  all  studies  show  that 
reliability  levels  below  70%  are  worse  than  hav¬ 
ing  no  automation  at  all  (St.  John  &  Manes, 
2002),  the  majority  of  the  studies  examined  in  the 
literature  do  seem  to  indicate  that  this  may  be  an 
emerging  conclusion.  Furthermore,  this  may  have 
implications  for  other  domains  (outside  of  the 
UAV  arena)  that  use  diagnostic  alerts,  such  as  air¬ 
port  luggage  screening  and  air  traffic  control. 


Perhaps  the  most  important  implications  of 
the  current  results  go  beyond  those  specific  to 
UAVs  and  relate  to  the  general  implications  of  the 
designer’s  flexibility  in  setting  the  alerting  thresh¬ 
old  in  multitask  environments.  On  the  one  hand, 
by  extending  the  findings  of  Maltz  and  Meyer 
(2003),  these  results  reveal  profoundly  different 
effects  on  attention  allocation  and  attention 
switching,  the  ultimate  costs  of  which  must  de¬ 
pend  on  the  importance  of  ongoing  tasks  and 
alerting  tasks.  In  this  context,  attention  appears 
to  be  driven  by  the  cost  of  total  (human  and  sys¬ 
tem)  misses  versus  false  alarms.  On  the  other 
hand,  the  results  provide  some  promise  for  the 
development  of  computational  models  of  automa¬ 
tion  effects  that  can  be  employed  in  predicting 
human-automation  interaction. 
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